Why is Automatic Recognition of Spontaneous Speech So Difficult?

نویسندگان

  • Sadaoki Furui
  • Masanobu Nakamura
  • Koji Iwano
چکیده

Although speech derived from reading texts, and similar types of speech, e.g. that from reading newspapers or that from news broadcast, can be recognized with high accuracy, recognition accuracy drastically decreases for spontaneous speech. This is due to the fact that spontaneous speech and read speech are significantly different acoustically as well as linguistically. This paper reports analysis and recognition of spontaneous speech using a large-scale spontaneous speech database “Corpus of Spontaneous Japanese (CSJ)”. Recognition results in this experiment show that recognition accuracy significantly increases as a function of the size of acoustic as well as language model training data and the improvement levels off at approximately 7M words of training data. This means that a very large corpus is needed to encompass the huge linguistic and acoustic variations which occur in spontaneous speech. Spectral analysis using various styles of utterances in the CSJ shows that the spectral distribution/difference of phonemes is significantly reduced in spontaneous speech compared to read speech. Experimental results also show that there is a strong correlation between mean spectral distance between phonemes and phoneme recognition accuracy. This indicates that spectral reduction is one major reason for the decrease of recognition accuracy of spontaneous speech. Comparative analysis of statistical language models for written language, including newspaper articles, and spontaneous speech shows that there is a significant difference between written language and spontaneous speech in terms of observation frequency of each part-of-speech and perplexity.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Database for Automatic Persian Speech Emotion Recognition: Collection, Processing and Evaluation

Abstract   Recent developments in robotics automation have motivated researchers to improve the efficiency of interactive systems by making a natural man-machine interaction. Since speech is the most popular method of communication, recognizing human emotions from speech signal becomes a challenging research topic known as Speech Emotion Recognition (SER). In this study, we propose a Persian em...

متن کامل

Why Sentence Modality in Spontaneous Speech is More Difficult to Classify and why this Fact is not too bad for Prosody

Why Sentence Modality in Spontaneous Speech is More Diicult to Classify and why this Fact is not too bad for Prosody \You crazy," said Max. It was either a statement or a question. \So you're our man, then," he said. It was half statement, half question. ABSTRACT We show in this paper that the labeling of sentence modality in German, esp. of questions vs. non-questions, is more diicult for spon...

متن کامل

Allophone-based acoustic modeling for Persian phoneme recognition

Phoneme recognition is one of the fundamental phases of automatic speech recognition. Coarticulation which refers to the integration of sounds, is one of the important obstacles in phoneme recognition. In other words, each phone is influenced and changed by the characteristics of its neighbor phones, and coarticulation is responsible for most of these changes. The idea of modeling the effects o...

متن کامل

Automatic Spontaneous Speech Recognition for Punjabi Language Interview Speech Corpus

Automatic Speech Recognition presents natural phenomena for the communication among man and machine. The purpose of Speech Recognition speech system is to convert the sequence of sound units in the form of text description. The main objective of the research work is to develop the automatic spontaneous speech model for the Punjabi language. Punjabi is categorized as a constituent of the Indo-Ar...

متن کامل

Why is Speech Recognition Difficult?

In this paper we will elaborate on some of the difficulties with Automatic Speech Recognition (ASR). We will argue that the main motivation for ASR is efficient interfaces to computers, and for the interfaces to be truly useful, it should provide coverage for a large group of users. We will discuss some of the issues that make the recognition of a single speaker difficult and then extend the di...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006